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(54)11116: DATA ROUTING DEVICES 



(57) Abstract 



A data routing device is described 
which may. for example, be used in 
field programmable processing arrays, 
field programmable gate arrays and other 
reconfigurablc logic devices, or which may, 
for example, be embodied as comcr-tuming 
memory. The data routmg device comprises 
at least one cormcction matrix (40) fOT 
routing data in the device, the cormcction 
matrix including a plurality of memory cells 
(42); and means for providing input to and/or 
output from said memory cells, wherein said 
means includes a tree stiwture of input paths 
to and/or output paths from the memory 
cell, and wherem each such path of the tree 
structure includes a set of branching choices 
(52. 54) along the tree structure. 
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DATA ROUTING DEVICES 



DESCR] 



This invention relates to data routing devices, which may, for exaiiq)Ie, be used in field 
programmable processing arrays, field programmable gate arrays and other 
10 reconfigurable logic devices, or ^ch may, for exaiiq)Ie, be embodied as comer-turning 
memory. 

The problems with which the present invention (or at least preferred CTibodunents of it) 
are concerned are to enable the device to be constructed with a high density of the 
15 memory and to permit high-speed operation of the memory. 

In accordance with the present invention, there is provided a data routing device, 
comprising: at least one connection matrix for routing data in the device, the connection 
matrix including a plurality of memory cells; and means for providing mput to and/or 
20 output from said memory cells, wherein said means includes a tree structure of uq)ut 
paths to and/or output paths from the memory cells, and wherein each such path of the 
tree structure includes a set of branching choices along the tree structure. 

The use of such a tree structure reduces the amount of wiring which is necessary 
25 enabling a high density to be achieved. The device may be constructed so that all of the 
paths include the same number of branching choices. Also, at any level in the tree 
structure, the number of branches at any branching choice at that level is preferably 
equal to the number of branches at the other branching choices at that level. Preferably, 
the number of branches at each branching choice is two, four or eight. Furthermore, all 
30 of the paths are preferably of substantially the same length. These feamres are 
particularly advantageous in allowing rapid multiple writes to the memory. 
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In one embodiment, the connection matrix, or at least one of the connection matrices, 
includes a plurality of switches; and the memory cells are operable to store data for 
controlling the switches to define the configuration of the intercoimections of that 
connection matrix. In this case, the connection matrix may be arranged to interconnect 
S a plurality of processing devices or gate arrays. 

In another embodiment, the memory cells of the connection matrix, or at least one of the 
connection matrices, are arranged to receive data in one format, to store the data 
temporarily, and to ou^ut die data in another format. In this case, the memory cells may 
10 be arranged as a comer-turning memory, for example for converting data words between 
nibble-serial format and nibble-parallel format. 

The tree structure may provide input paths for addressing the memory cells, with each 
branching choice being provided by an address decoder which, in response to an input 

15 address from a higher level, is operable to sub-address less than all of the address 
decoders or memory cells at the next lower level. In this case, the address decoders may 
be operable to sub-address only one or two of the address decoders or memory cells at 
the next lower level. In one embodiment, at at least one level of the tree structure, the 
or each address decoder is operable to address a selectable number of the address 

20 decoders at the next lower level. This enables multiple sunultaneous writes to be made 
to patterns of the memory cells, which can be pardcularly advantageous in the case of 
memory cells which are used for configuring, for example, a field programmable gate 
or processor array. 

25 Additionally or alternatively, the tree structure may provide input and/or output paths 
for data to/firom the memory cells, with each branching choice being provided by a 
multiplexer which passes data from the next higher level to less than all of the 
multiplexers or memory cells at the next lower level and/or which passes to the next 
higher level data from less than all of the mult^lexers or memory cells at the next lower 

30 level. 
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Specific embodiments of the present invention will now be described, purely by way of 
example, with reference to the accompanying drawings, in which: 

Figure 1 shows part of a processor array, illustrating six switching sections and the 
5 locations of six arithmetic logic units; 

Figure 2 is a diagram of part of the arrangement shown in figure 1 on a larg^ scale, 
illustrating one of the switching sections and one of the locations of the arithmetic logic 
imits; 



10 



Figure 3 shows part of the processor array shown in figure 1 on a smaller scale, 
illustrating the locations of the arithmetic logic units and "vertical" busses extending 
across them; 



IS Figure 4 is similar to figure 3, but illustrating "horizontal" busses extending across the 
locations of the arithmetic logic units; 

Figure S shows the interconnections between the busses of figures 2, 3 and 4 at the 
location of one of the arithmetic logic units; 

20 

Figure 6A shows in detail the circuitry of one type of programmable switch m the 
switching sections, for connecting a pair of 4-bit busses which cross each other; 

Figure 6B shows in detail the circuitry of another type of programmable switch in the 
25 switching sections, for connecting a pair of 4-bit busses which meet each other end to 
end; 



Figure 6C shows in detail the circuitry of another type of programmable switch in the 
switching sections, for connectmg carry-bit busses; 

30 

Figure 7 shows the circuitry of a series of NOR gates which nuiy be used in die 
SUBSTITUTE SHEET (RULE 26) 
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programmable switches of figures 5 and 6; 

Figure 8 shows a modification to the circuitry of figure 7; 

S Figure 9 shows a buffer and register which may be used in each switching section; 

Figure 10 is a schematic drawing illustrating how enable signals may be distributed to 
the programmable switches in the switching sections; 

10 Figure 1 1 shows in more detail the curcuitry of the arrangement shown in figure 10; 

Figure 12 shows a "comer-turning" RAM; 

Figure 13 illustrates vertical access to a comer-turning RAM; 

15 

Figure 14 illustrates horizontal access to a comer-turning RAM; 

Figure IS shows the comer-turning RAM of figure 12 m more detail; 

20 Figure 16 illustrates an example of data paths in the comer-turning RAM of figure IS 
when used for vertical access; 

Figure 17 illustrates an example of data paths in the comer-turning RAM of figure IS 
when used for horizontal access; and 

25 

Figure 18 shows a modification to the arrangement described with reference to figures 
10 and 11 to allow multiple simultaneous writes to the memory cells. 

. In the following description, the terms "horizontal", "vertical", "North", "South", 
30 "East" and "West" have been used to assist in an understanding of relative directions, 
but their use is not intended to imply any restriction on the absolute orientadon of the 
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embodiment of the. invention. 

The processor array which forms the embodiment of the invention is provided in an 
integrated circuit. At one level, the processor array is formed by a rectangular (and 
5 preferably square) array of "tiles" 10, one of which is shown bounded by a thick line in 
figure 1. Any appropriate number of tiles may be employed, for example in a 16 x 16, 
32 X 32 or 64 X 64 array. Each tile 10 is rectangular (and preferably square) and is 
divided into four circuit areas. Two of the circuit areas 12, which are diagonally 
opposed in the tile 10, provide the locations for two arithmetic logic units ("ALUs"). 
10 The other two circuit areas, which are diagonally opposed in the tile 10, provide the 
locations for a pair of switching sections 14. 

Referring to figures 1 and 2, each ALU has a first pair of 4-bit inputs a, which are 
directly connected within the ALU, a second pair of 4-bit inputs b, which are also 

15 directly connected within the ALU, and foiu- 4-bit ou^uts f, which are direcfly 
connected within the ALU. Each ALU also has an independent pair of 1-bit carry inputs 
hci, vci, and a pair of 1-bit carry outputs co, which are directly connected within the 
ALU. The ALU can perform standard operations on the input signals a, b, hci, vci to 
produce the output signals f, co, such as add, subtract, AND, NAND, OR, NOR, XOR, 

20 NXOR and multiplexing and optionally can register the result of the operation. The 
instructions to the ALUs may be provided from respective 4-bit memory cells whose 
values can be set via the "H-tree" structure described below, or may be provided on the 
bus system which will be described below. 

25 At the level shown in figures 1 and 2, each switching section 14 has eight busses 
extending across it horizontally, and eight busses extending across it vertically, thus 
forming an 8 x 8 rectangular array of 64 crossing points, which have been numbered in 
figure 2 with Cartesian co-ordinates. All of the busses have a width of four bits, with 
the exception of the carry bus vc at X=4 and the carry bus he at Y— 3, which have a 

30 width of one bit. At many of the crossing points, a 4-gang programmable switch 16 is 
provided which can selectively connect the two busses at that crossing point. At some 
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of the crossing points, a 4-gang programmable switch 18 is provided which can 
selectively connect two busses which meet end to end at that crossing point, without any 
connection to the bus at right angles thereto. At the crossing point at (4, 3), a 
programmable switch 20 (for example as shown in Figure 6C) is provided which can 
S selectively connect the carry busses vc, he which cross at right angles at that point. 

The horizontal busses in the switchmg section 14 will now be described. 



At Ys=0, busses h2s are connectable by programmable switches 16 to the vertical busses 
10 at X=0, 1, 2, 5, 6. The busses h2s have a length of two tiles and are connectable end 
to end in every other switching section 14 by a programmable switch 18 at (4, 0). 

At Y= 1, a bus be extending from an input b of the ALU to the West is connectable by 
switches 16 to the vertical busses at X=0, 1, 2, 3. Also, a bus fw extending from an 
15 output f of the ALU to the East is connectable by switches 16 to the vertical busses at 
X=S, 6, 7. The ends of the busses be, fw are connectable by a programmable switch 1 8 
at (4,1). 

At Y=2, a bus hregs is connectable by programmable switches 16 to the vertical busses 
20 atX=l,2. 3, 5,6,7. 

At Y=3, a bus hco extends from die carry ouq)ut co of the ALU to the West to a 
programmable switch 20 at (4, 3), which can connect the bus hco (a) to a carry bus hci 
extending to the carry input hci of the ALU to the East or (b) to a carry bus vci 
25 extending to the carry input vci of the ALU to the South. 

At Y=4, a bus hregn is connectable by programmable switches 16 to the vertical busses 
atX=0, 1,2, 3, 5,6. 

30 At Y=5, busses hi are connectable to the vertical busses at X^'0, 1, 2» 3, 5, 6, 7. The 
busses hi have a length of one tile and are connectable end to end in each switching 
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section 14 by a programmable switch 18 at (4, 5). 



PCT/GB98/00274 



At Y= 6, a bus fe extending from an output f of the ALU to the West is connectable by 
switches 16 to the vertical blisses at X=0, 1. 2, 3. Also, a bus aw extending from an 
S input a of the ALU to the East is connectable by switches 16 to the vertical busses at 
X=5, 6, 7. The ends of the busses fe, aw are connectable by a programmable switch 18 
at (4» 6). 

At Y=7, busses h2n are connectable by programmable switches 16 to die vertical busses 
10 at X= 1» 2» 3, 6, 7. The busses h2n have a length of two tiles and are connectable end 
to end in every other switching section 14 by a programmable switch 18 at (4, 7), 
staggered with respect to the progranmnable switches 18 connecting the busses h2s at (4, 
0). 

IS The vertical busses in the switching section 14 will now be described. 

At X=0, busses v2w are coimectable by programmable switches 16 to the horizontal 
busses at Y=0, 1, 4, 5, 6. The busses v2w have a length of two tiles and are 
connectable end to end in every other switching section 14 by a progranunable switch 
20 18 at (0,3). 

At X= 1, a bus fri extending from an output f of the ALU to the South is connectable by 
programmable switches 16 to the horizontal busses at Y==0, 1, 2. Also, a bus bs 
extending from an input b of the ALU to the North is connectable by switches 16 to the 
25 horizontal busses at Y=4, 5, 6, 7. The ends of the busses fii, bs are connectable by a 
programmable switch 18 at (1, 3). 

At X=2, busses vl are connectable to the horizontal busses at Y=0, 1, 2, 4, 5, 6, 7. 
The busses v 1 have a length of one tile and are connectable end to end in each switching 
30 section 14 by a programmable switch 18 at (2, 3). 
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At X=3, a bus vregw is connectable by programmable switches 16 to the horizontal 
busses at Y= 1, 2, 4, 5, 6, 7. 

At X=4, a bus vco extends from the carry output co of the ALU to the North to the 
S programmable switch 20 at (4, 3), whidicanconnectthebus vco(a) tothecarry bushci 
extending to the carry input hci of the ALU to the East or (b) to the cany bus vci 
extending to the carry mput vci of the ALU to the South. 

At X=5, a bus vrege is connectable by programmable switches 16 to the horizontal 
10 busses at Y=0, 1. 2, 4, 5, 6. 

At X= 6, a bus an extending from an iiq)ut a of the ALU to the S outh is connectable by 
switches 16 to the horizontal busses at Y=0, 1, 2. Also, a bus fs extendmg from an 
output f of the ALU to the North is connectable by programmable switches 16 to the 
15 horizontal busses at Y=4, 5, 6, 7. The ends of the busses an. fe are connectable by a 
programmable switch 18 at (6, 3). 



At X=:7, busses v2e are connectable by programmable switdies 16 to the horizontal 
busses at Y= 1, 2, 5, 6, 7. The busses v2e have a length of two tiles and are connectable 
20 end to end in every other switdiing section 14 by a programmable switch 18 at (7, 3) 
staggered with respect to the programmable switches 18 connecting the busses v2w.at 
(0. 3). 

As shown in figure 2, the busses bs, vco, fe are connected to input b, output co and 
25 output f , respectively, of the ALU to the North of the switching section 14. Also, the 
busses fe, hco, be are connected to the output f, output co and input b of the ALU, 
respectively, to the West of the switching section 14. Furthermore, the busses aw, hci, 
fw are connected to the input a, input ci and output f, respectively, of the ALU to the 
East of the switching section 14. Moreover, the busses fa, vci, an are connected to the 
30 output f, input ci and input a, respectively, of the ALU to the south of the switching 
section 14. 
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In addition to these connections, the busses vregw, vrege are connected via respective 
programmable switches 18 to 44)it connection points vtsw, vtse, respectively, (shown 
by crosses in Figure 2) in the area 12 of the ALU to the North of the switching section 
14. Also, the busses hregs, hregn are connected via respective programmable switches 

5 18 to 4-bit connection points htse, htne, respectively, in the area 12 of the ALU to the 
West of the switching section 14. Furthermore, the busses hregs, hregn are connected 
via respective programmable switches 18 to 4-bit connection points htsw, htnw, 
respectively, in the area 12 of the ALU to the East of the switching section 14. 
Moreover, the busses vregw, vrege are connected via respective programmable switches 

10 18 to 4-bit connection points vtnw, vme, respectively, in the area 12 of the ALU to the 
south of the switching section 14. These connection points vtnw, vtne, htne, htse, vtse, 
vtsw, htsw, htnw will be described below in further detail with reference to figiures 3 to 
5. 



15 Also, as shown in figure 2, the busses hregn, vrege, hregs, vregw have respective 4-bit 
connection points 22 (shown by small squares in figure 2) whidi will be described below 
in further detail with reference to figure 9. 



Figure 3 shows one level of interconnections between the locations of the arithmetic 
20 logic units, which are illustrated by squares with rounded comers. A group of four 4-bit 
busses v8, v4w, v4e, vl6 extend vertically across each colunm of ALU locations 12. 
The leftmost bus v8 in each group is in segn^ts, each having a length generally of eight 
tiles. The leftmost but one bus v4w in each group is in segments, each having a length 
generally of four tiles. The rightmost but one bus v4e in each group is in segments, 
25 again each having a length generally of four tiles, but offset by two tiles from the 
leftmost but one bus v4w. The rightmost bus vl6 in each group is in segments, each 
having a length generally of sixteen tiles. At the top edge of the array, which is at the 
top of figure 4, and at the bottom edge the lengths of the segments may be slightly 
greater than or shorter than specified above. 

30 

Referring to figures 3 and 5, where each group of four busses v8, v4w, v4e, vl6 crosses 
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each ALU location 12, four 4-bit tap connections are made at the connection points 
htnw, htsw, htse, htne. The ends of the bus segments take priority in being so connected 
over a connection to a bus segment which crosses the ALU location. 

5 Similarly, shown in figures 4 and 5, a group of four 4-bit busses h8, h4n, h4s, hl6 
extend horizontaOy across each row of ALU locations 12. The uppermost bus h8 in each 
group is in segments, each having a length generally of eight tiles. The uppermost but 
one bus h4n in each group is in segments, each having a length generally of four tiles. 
The lowermost but one bus h4s in each group is in segments, again each having a length 
10 generally of four tiles, but ofEset by two tiles from the uppermost but one bus h4n. The 
lowermost bus hl6 in each group is in segments, each having a length generally of 
sixteen tiles. At the left hand edge of the array, which is at the left of figure 4, and at 
the right hand edge the lengths of the segments may be slightly greater than or shorter 
than specified above. Where each group of busses h8, h4n, h4s, hl6 crosses each ALU 
. 15 location 12, a further four 4-bit tap connections are made at die connection points vtow, 
vtsw, vtse, vme. The ends of the bus segments take priority in being so connected over 
a coimection to a bus segment which crosses the ALU location. 

As shown in figure S, the connection points htnw, htsw, htne, htse are connected via 
20 programmable switches to the busses hregn, hregs of the switching sections to the West 
and the East of the ALU location. Also, the connection points vtnw, vme, vtsw, vtse are 
coimected via programmable switches to the busses vregw, vrege of the switching 
sections to the North and the South of the ALU location. 

25 The programmable connections 16 between pairs of 4-bit busses which cross at right 
angles will now be described with reference to figure 6A. The conductors of the 
horizontal busses are denoted as xO, xl, x2, x3, and the conductors of the vertical busses 
are denoted as yO, yl, y2, y3. Between each pair of conductors of the same bit 
significance, a respective transistor 160, 161, 162, 163 is provided. The gates of the 

30 transistors 160, 161, 162, 163 are connected in common to the output of a NOR gate 
16g, which receives as its two inputs an inverted ENABLE signal firom a single bit 

SUBSTITUTE SHEET (RULE 26) 



07/08/2003, EAST Version: 1.03.0002 



wo 98/33182 PCT/GB98/00274 

-11- 

memory cell, which may be shared by a group of the switches, and the inverted content 
of a single bit memory cell 24. Accordingly, only when the ENABLE signal is high and 
the content of the memory cell 24 is high, the conductors xO, xl, x2, x3 are connected 
by die transistors 160, 161, 162, 163, respectively, to the conductors yO, yl. y2. y3, 
5 respectively. 

The programmable connections 18 between pairs of 4-bit busses which meet each other 
end to end in line will now be described widi reference to figure 6B. The conductors of 
one bus are denoted as xlO, xll, xl2, xl3, and the conductors of the other bus are 

10 denoted as x20, x21, x22, x23. Between each pair of conductors of the same bit . 
significance, a respective transistor 180, 181, 182, 183 is provided. The gates of the 
transistors 180, 181, 182, 183 are connected in conunon to the output of a NOR gate 
18g, which receives as its two inputs an inverted ENABLE signal from a single bit 
memory cell, which may be shared by a group of the switches, and the inverted content 

15 of a single bit mraiory cell 24. Accordingly, only when die ENABLE signal is high and 
the content of die memory cell 24 is high, the conductors xlO, xll, xl2, xl3 are 
connected by the transistors 180, 181, 182, 183, respectively, to the conductors x20, 
x21. x22, x23, respectively. 

20 The programmable connections 20 between the carry conductors hco,vco,hci,vci will 
now be described with reference to figure 6C. The horizontal carry output conductor hco 
is connected to the horizontal carry input con4uctor hci and the vertical carry input 
conductor vci via transistors 20hh, 20hv, respectively. Furthermore, the vertical carry 
output conductor vco is connected to the vertical carry mpui conductor vci and the 

25 horizontal carry input conductor hci via transistors 20w, 20vh, respectively. The gates 
of the transistors 20hh, 20w are connected in common to die output of an inverter 20i, 
and the gates of the transistors 20hv, 20vh and the input to the inverter 20i are connected 
to the output of a NOR gate 20g. The NOR gate 20g receives as its two inputs an 
inverted ENABLE signal from a single bit memory cell, which may be shared by a 

30 group of the switches, and the inverted content of a single bit memory cell 24. 
Accordingly, when the ENABLE signal is high, the conductors hco, vco are connected 
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to the conductors hci, vci, respectively, or to the conductors vci, hci, respectively^ in 
dependence upon the content of the memory cell 24. 

It will be noted that each of the switchable connections 16, 18, 20 described with 
5 reference to figures 6A to 6C includes a NOR gate 16g, 18g, 20g. As shown in figure 
7, a NOR gate 16g is typically formed by four transistors 16gl , 16g2, 16g3, 16g4, two 
16gl, 16g3 of which are responsive to the invened ENABLE signal, and two 16g2, 
16g4 of which are responsive to the inverted content of the memory cell 24. In the 
embodiment of the invention, it is desirable that a group of the switchable collections 16 , 

10 18, 20 may be disabled in common, without any need for only pan of such a group to 
be disabled. Such a group might consist of ail of the switchable connections in one 
switdiing section 14, all of the switchable connections m the two switching sections 14 
in a particular tile, or all of the switchable connections in a larger area of the array . In 
this case, the transistor 16gl may be made common to all of the switchable connections 

15 16, 18, 20 in the group, as shown in figure 8. This enables a 25% less one saving in the 
number of transistors required for the gates, but does require a further conductor linking 
the gate, as shown m figure 8. 

As mentioned above with reference to figures 1 and 2, at each switchmg section 14, the 
20 busses hregn, hregs, vregw, vrege are connected by respective 4-bit connections 22 to 
a register or buffer circuit, and this circuit will now be described in more detail with 
reference to figure 9. The four connections 22- are each connected to respective iiq)uts 
of a multiplexer 26. The multiplexer 26 selects one of the mputs as an output, which is 
supplied to a register or buffer 28. The ou^ut of the register or buffer 28 is supplied to 
25 four tri-state buffers 30s, 30w, 30n, 30e, which are connected back to the connections 
22 to the busses hregs, vregw, hregn, vrege, respectively. In the case where a buffer 28 
is used, the 4-bit signal on a selected one of the busses hregs, vregw. hregn, vrege is 
amplified and supplied to another selected one of the busses hregs, vregw, hregn, vrege. 
In the case where a register 28 is used, the 4-bit signal on a selected one of the busses 
30 hregs, vregw, hregn, vrege is ampUfied and supplied to any selected one of the busses 
hregs, vregw, hregn, vrege after the next active clock edge. 
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It will be appreciated that the arrangement described above provides great flexibility in 
the routing of signals around and across the array. With appropriate setting of the 
switches 16, 18, 20 using the memory cells 24 and with appropriate setting of the 
multiplexers 26 and registers or buffers 28, signals can been sent over large distances, 

5 primarily using the busses vl6, hl6, v8, h8, v4e, v4w, h4n, h4s from the edge of the 
array to a particular ALU, between ALUs, and from a particular ALU to die edge of the 
array. These busses can be joined together m Ime, or at right angles, by the switching 
sections 14, with amplification by the registers or buffers 28 in order to reduce . 
propagation delays, and with pqielme stages introduced by the registers 28. Also, these 

10 busses can be tapped part way along their lengths, so that the siting of the ALUs to 
perform a particular processing operation is not completely dictated by the lengths of the 
busses, and so that signals can be distributed to more than one ALU. Furthermore, the 
shorter length busses described with reference to figures 1 and 2 can be used to route 
signals between die switching sections 14 and the ALUs, and to send signals primarily 

15 over shorts distances, for example from one ALU to an adjacent ALU in the same row 
or column, or diagonally adjacent, even though the busses extend horizontally or 
vertically. Again, the registers or buffers 28 can be used to amplify the signals or 
introduce programmable delays into them. 

20 In the arrangement described above, die memory cells 24 are distributed across die array 
to the same extent as the switching secdons 14 and the ALU locations 12. Each memory 
cell 24 is disposed adjacent the switch or switches, multiplexer, register or buffer which 
it controls. This enables a high circuit density be achieved* 

25 A description will now be made of the manner in which data is written to or read from 
the memory cells 24, the way in which die ENABLE signals for the programmable 
switches 16, 18, 20 are written to dieir memory cells, die way in which instructions, and 
possibly constants, are distributed to the ALUs, and the way in which other control 
signals, such as a clock signal, are transmitted across the array. For all of these 

30 functions, an "H-tree" strucnire may be employed, as shown m figure 10. Referring to 
Figures 10 and 1 1, in order to distribute an ENABLE signal to any of 64 locations in . 
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the exmplo shown, the ENABLE signal 30a and a 6-bit address 32a for it are supplied 
to a decoder 34a. The decoder 34a determines which of the four branches from it leads 
to the address and supplies an ENABLE signal 30b to a further decoder 34b in that 
branch, together with a 4-bit address 32b to the decoders 34b in all foui branches. The 
S decoder 34b receiving the ENABLE signal 30b determines which of the four branches 
from it leads to the required address and supplies an ENABLE signal 30c to a further 
decoder 34c in that branch, together with a 4-bit address 32c to the decoders 34c in all 
four branches. The decoder 34c receiving the ENABLE signal 30c then supplies the 
ENABLE signal 34d to the requured address where it can be stored m a single bit 
10 memory cell. An advantage of the H-tree structure is that the lengths of the signal paths 
to all of the destinations are approximately equal, which is particularly advantageous in 
the case of the clock signal. 

A great advantage of the arrangement described above is that groups of the memory cells 
15 24 in for example one switching section 14, or in the two switching secti ons in one tile, 
or in the switching sections in a sub-array of the tiles may be disabled en bloc by the 
inverted ENABLE signals so that the contents of those memory cells do not affect the 
associated switches. It is then possible for those memory ceUs 24 to be used as "user" 
memory by an application, rather than being used for configuring the wiring of the 
20 array. 

The embodiment of the mvention has been described merely by way of example, and 
many modifications and developments may be made in keeping with the present 
invention. For example, the embodiment employs ALUs as the processing units, but 
25 other processmg units may additionally or alternatively be used, for example look-up 
tables, programmable logic arrays and/or self-contained CPUs which are able to fetch 
their own instructions. 

Furthermore, the embodiment has been described as if the whole array is covered by 
30 ALUs and switching sections. However, other types of section may be included in the 
array. For example, a sub-array might be composed of a 4 x 4 arrangement of tiles of 
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ALUs and switching sections as described above, and the array might be composed of 
such sub-arrays and memory in a 4 x 4 array, or such sub-arrays and RISC CPUs in a 
4x4 array. 

5 In the embodiment described above, each ALU location is square, and each switching 
section is square and of the same size as the ALU locations, but it should be noted that 
the controllable switdies 18 in the register busses vregw, vrege, hregn, hregs encroach 
into the square outlme of the ALU locations. The ALU locations need not be of the same 
size as the switdung sections, and in particular may be smaller, thus permittmg one or 

10 moie busses to pass horizontally or vertically directly from one switching section 14 to 
a diagonally adjacent switching section 14, for example running between the busses h2s, 
h2ii or between the busses v2e, v2w. 

In the embodunent described above, each ALU has two independent carry inputs vci, 
15 hci and a connected pair of carry outputs co. If required, the ALUs may be arranged to 
deal with two types of carry: a fast carry between adjacent ALUs which may be of 
particular use for multi-bit adding operations; and a slow carry which can be routed 
more flexibly and may be of particular use for digital serial arithmetic. The fast carry 
might be arranged in a similar manner to that described above with reference to the 
20 drawings, whereas the slow carry might enq)ioy programmable switches in the switching 
sections 14 between the carry conductor and particular bits of the 4-bit busses. 

In the embodiment described above, particular bit widths, sizes of switchmg section and 
sizes of array have been mwitioned, but it should be noted diat all of fliese values may 
25 be changed as appropriate. Also, the programmable switches 16, 18, 20 have been 
described as being disposed at particular locations in each switching section 14, but other 
locations may be used as required and desired. 

In the embodiment described above, the array Is two-dimensional, but the principles of 
30 the invention are also applicable to three-dimensional arrays, for example by providing 
a stack of the arrays described above, with the switching sections in adjacent layers 
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Staggered with respect to each other. The stack might include just two layers^ but 
preferably at least three layers, and the number of layers is preferably a power of two. 

In the embodiment described above, the memory cells 24 can be isolated by the gates 
5 16g, 18g, 20g from the switches which they control so that the memory cells can be used 
for other purposes, that is put in the "user plane". The ENABLE signal memory cells, 
however, cannot be transferred to the user plane. In an alternative embodiment, the 
switches ui a particular switching section 14 may be disconnectabie from the remainder 
of the array by further switches in the busses at the boundary of that s witdiing section 
10 14, with the further switches being controlled by a further memory cell which cannot be 
transferred to the user plane. 

A further embodiment of the present invention will now be described with reference to 
figures 12 to 17. This embodiment is applied to a comer-turning RAM 40, the principle 

15 of operation of which is shown in figure 12. As shown, the RAM 40 conq)rises a 4x4 
array of memory ceils 42, each of which can store four bits (ie a nibble) of data. The 
RAM 40 has two ports 44, 46, one 44 of which operates with data in nibble-parallel 
format, reading or writing rows of data, each constitutmg a word, into the RAM 40. The 
other port operates with data in nibble-serial format, reading or writing corresponding 

20 columns of data into the RAM 40, each colmnn containing corresponding nibbles from 
multiple words of data. 

The comer-turning RAM may be used in combination with the first embodiment of the 
invention. For example, in order to perform operations with 16-bit precision on 16-bit 

25 operands, with the first embodiment four of the ALUs may be used. However, in order 
to conserve ALU use, a single ALU may be used handling the operands in nibble-serial 
format. A comer-turning RAM 40 may therefore be used firstly to convert the operands 
from nibble-parallel format to nibble-serial format and then to convert the result from 
nibble-serial format back to nibble-parallel format. The comer-turning RAM 40 may, 

30 of course, also be used indepradently of the first embodiment of the invention. 
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Although figure 12 shows the comer-turning RAM as a square array, this is not 
necessary. The width of the array is d&tcmm6d by the width of the nibble-parallel port, 
and the height is determined by the nxunber of parallel worlds to be extracted from the 
nibble-serial port. These are two independent design parameters. Furthermore* in 
5 princq[)le, neither the width nor the height need be a power of two nibbles, but power-of- 
two dimensions can conveniently be chosen to simplify the control of the RAM 40. 
Because the arrangement described below works best with power-of-two RAM 
dimensions, this will be described, but the vertical and horizontal dimensions may be 
different. 

10 

For conect operation, the whole RAM 40 needs to be written in one orientation before 
being read in the other orientation. For exanq)Ie, if a column is read before all four rows 
have been written, the nibbles from the unwritten rows will contain invalid data. Also, 
the whole RAM contents need to be read in the reading orientation before new data is 
15 written into the same address space in the RAM 40 from the writing side. For these 
reasons, the comer-turning RAM may be used in pairs, to allow double-buffering. 

In this further embodiment of the invention, the comer-turning RAM 40 is implemented 
m a hierarchical fashion, with each level of hierarchy including a fector of two increase 
20 in memory size in either or both durections. Figures 13 and 14 show a suigle level of 
hierarchy that includes a fector of two size increase in both directions, making a factor 
of four overall. 

When accessing the memory vertically, as shown in figure 13, one address bit is used 
25 to determine which row of the two lower level rows of blocks 42 is accessed, either low 
address or high address. The two accessed blocks are accessed in parallel in this case 
using the data bus VO for the four bits of the low nibble and the data bus VI for the fou r 
bits of the high nibble. By contrast, figure 14 illustrates access to the hierarchical comer- 
turning RAM when made horizontally, rather than vertically. In this case, one address 
30 bit is used to determine which column of the two lower level columns of blocks 42 is 
accessed, the low address or high address. The two accessed blocks are accessed in 
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parallel using the data bus HO for die four bits of the lower nibble and the data bus HI 
for the four bits of the high nibble. It will be noted from figures 13 and 14 that, with 
either orientation of access, the top-Iefl block 42 has the low address and the low nibble, 
and the bottom-right block 42 has the high address and the high nibble. However, the 

S top-right block 42 changes between low-address/high-nibble during vertical access and 
high-address/low-nibble during horizontal access, and the bottom left block 42 changes 
between high-address/low-nibble for vertical access and low-address/high-nibble for 
horizontal access. The address decoding and data multiplexing at this level of the 
hierarchical RAM 40 needs to be controllable according to the access orientation to 

10 allow the RAM to be used as a comer-turning RAM. 

In a larger hierarchical comer-turning RAM, the address decoding and data multiplexing 
at each level of the hierarchy is controlled according to the access orientation, as 
described above. 

15 

Figure 15 illustrates a comer-turning RAM 40 having two levels of hierarchy in each 
direction, thus providing a 16-fbld increase in memory size, and which employs an "H- 
tree** stmcture for both the address paths and the data paths. In figure 15, the 4-bit 
memory cells 42 have been marked with 4-bit labels 0000 to 1 1 1 1 ; the columns of the 
20 memory cells 42 have been marked with two-bit addresses 00 to 1 1 ; and the rows of the 
memory cells 42 have been marked with two-bit addresses 00 to 11. The four 2x2 
groups of the memory cells 42 each have a respective lower-level address decoder and 
data multiplexer 52 labelled 00 to 1 1, and there is a central higher-level address decoder 
and data multiplexer 54. 

25 

The addressing operation of the RAM 40 of figure 15 will now be described. The 
higher-level decoder/multiplexer 54 receives two signals AO, Al giving the address of 
the row or the column of the memory cells 42 to be accessed, together with a signal O 
indicating whether vertical (logic level "0") or horizontal (logic level "T) access is 
30 required. The signal AO is passed directly to the lower-level decoder/multiplexers 52. 
The signal Al is used by the higher-level decoder 54. The signal O is both.used by the 
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higher-level decode/multiplexer 54 and is also passed on to the lower-level 
decoder/multiplexers 52. The higher-level decoder/multiplexer 54 produces four select 
signals SOO to SI 1, whidi are supplied to the respecrive lower-level decoder/multiplexers 
52(00) to 52(11) and are generated according to the following: 
5 SOO = not(Al) 

SOI = (not(Al) and not(O)) or (Al and O) 

510 = (not(Al) and O) or (Al and not(0)) 

511 = Al 

10 The lower-level decoder/multiplexers 52 each produce four address signals AOO to Al 1 
which are supplied to the respective memory cells 42 serviced by that 
decoder/multiplexer 52 and are generated according to the following: 
AOO = Sxx and not(AO) 

AOl = Sxx and ((not(AO) and not(0)) or (AO and O)) 
15 All = Sxx and ((not(AO) and O) or (AO and not(0))) 

All = Sxx and AO 

where Sxx denotes the respective select signal SOO to Sll received by that lower-level 
decoder/multiplexer 52. 

20 Accordingly, it will be appreciated that, with vertical access (0=0), a row of the 
memory cells 42 is addressed as designated by the address signals AO, Al. By contrast, 
with horizontal access (0=1), a column of the memory cells 42 is addressed as 
designated by the address signals AO, Al. 

25 The data paths of the RAM 40 of figure 15 will now be described. The input/output data 
path for vertical access is shown as four 4-bit busses VOO to Vll connecting to the 
higher-level decoder/multiplexer 54, and the input/output data path for horizontal access 
is shown as four 4-bit busses HOO to HU also connected to the higher-level 
decoder/multiplexer 54. Each of the lower-level decoder/multiplexers 52(xx) is 

30 connected to the higher-level decoder/multiplexer 54 by two 4-bit data busses DxxA, 
DxxB. Also, each of the lower-level decoder/multiplexers 52 is connected to each of its 
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memory cells 42 by respective 4-bit data busses dOO to dll. The logical relations by 
which the lower-level decoder/muldplexers 52 connect the data busses dOO to dl 1 to the 
data busses DxxA, DxxB are similar to those employed by the decoder/multiplexers 52 
for addressing the memory cells 42. Furthermore, the logical relations by which the 
5 higher-level decoder/multiplexers 54 connect the data busses DxxA, DxxB to the vertical 
and horizontal data busses VOO to Vll, HOO to HI 1 are similar to those employed by 
the decoder/multiplexer S4 for addressing selecting the lower-level decoder/multiplexers 
52. 

10 For exanq)le, as shown in figure 16, when the address inputs are O-O, Al-1 and 
A0:=0, denoting that the memory cells 42 m row 10 should be connected to the vertical 
input/output busses VOO-Vll, the lower-level decoder/multiplexer 52(10) connects its 
memory cell data busses dOO, dOl to the data busses DIOA, DlOB, respectively, and the 
higher-level decoder/multiplexer 54 connects the data busses DlOA, DlOB to the vertical 

15 input/output busses VOO, VOl, respectively. Also, the lower-level decoder/multiplexer 
52(11) connects its memory cell data busses dOO, dOl to the data busses DllB, Dll A, 
respectively, and the higher-level decoder/multiplexer 54 connects the data busses Dl IB, 
DUA to the vertical input/output busses VIO, Vll, respectively. Accordingly, the 
memory cells 42 in row 10 are connected to the vertical input/output data busses VOO- 

20 V 1 1 in the correct order. 



Figure 17 shows another example where the address inputs are 0= 1 , Al =0 and A0= 1 , 
denoting that the memory cells 42 in row 01 should be connected to the horizontal 
input/output busses HOO-HU. In this case, the lower-level decoder/multiplexer 52(00) 

25 connects its memory cell data busses dOl, dll to the data busses DOOA, DOOB, 
respectively, and the higher-level decoder/multiplexers 54 connects the data busses 
DOOA, DOOB to the horizontal input/output busses HOO, HOI, respectively. 
Furthermore, the lower-level decoder/multiplexer 52(10) connects hs memory cell data 
busses dOl, dll to the data busses DlOB, DlOA, respectively, and the higher-level 

30 decoder/multiplexer 54 connects the data busses DlOB, DlOA to the horizontal 
input/ouq)ut busses HIO, Hll, respectively. Accordingly, the memory cells 42 in 
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column 01 are connected to the horizontal input/output data busses HOO-Hll in the 
correct order. 

It will be appreciated that many modifications and developments may be made to the 
5 second embodiment of the invention. For example, as mentioned above, the array of 
memory cells need not be square, and neither the height nor the width of the array need 
be a power of two. 

Also, although the H-tree stnicmre has been employed both for the addressing paths and 
10 the data paths, it may be employed for only one of these. 

Furthermore, although separate horizontal and vertical busses HOO-HU, VOO-Vll have 
been described above, these busses may alternatively share the same conductors. 

IS Figure 18 shows a modification to the arrangement of figures 10 and 11 for writing data 
to a hierarchical RAM for configuring a field programmable gate array, field 
programmable processor array or the like. Configuring large arrays of this type requires 
a large amount of data, and loading die data can occupy the memory and bus bandwidth s 
for extended periods of time. A technique which reduces the amount of data to be loade d 

20 into the array can reduce the memory storage requirements, the bus bandwidth 
requirements and the delay in loading a new configuration. 

An array configuration for a regular computation is regular itself. In other words, 
multiple identical pieces of circuitry can be laid out so that they are all identical, and so 
25 that they "tile" neatly. This regularity can be exploited so that only one copy of be 
configuration data for the repeated circuitry needs to be loaded, and the single copy can 
be distributed to the multiple locations where copies of die circuitry are to be placed. 

In figure 18, a higher-level address decoder 34 receives a 4-bit address 
30 AOO, AOl , A10,A1 1 . The two more-significant bits AlO, Al 1 are used to produce signals 
S00,S01,S10,S11, each of which is supplied to a respective one of four lower-level 
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address decoders 52 labelled 00,01,10,11, so tbat in nonnal operation only one of the 
four decoders 52 is selected. The two less-significant bits A00,A01 are simply passed 
to the four lower-level decoders 52. Each of the lower -level decoders 52, if selected by 
its respective select signal Sxx, addresses a respective one of its four memory cells 42 
5 in dependence upon the two-bit address A00,A01, and data D which is provided to all 
of the lower-level decoders 52 is written to the addressed memory cell 42. Accordingly , 
the data D is written only one of the sixteen memory ceils 42. Thus, in this mode of 
operation, the decoding operation performed by the higher-level decoder 54 is defined 
by the following: 
10 SOD = not(A10) and not(All); 

SOI = AlOandnot(All); 

SIO = not(A10)andAll; 

SU = AlOandAll. 



15 For each of the more-significant address bits A 10, A 11 the higher-level decoder 54 also 
receives a respective wild-card bit W10,W11. The higher-level decoder 54 is arranged 
so that, if either of the wild-card bits W10,W11 is set, the respective address bit 
AlO.All is wild-carded. Thus, the decodmg operation performed by the higher-level 
decoder 54 becomes as follows: 

20 S00= {not(A10)orW10}and{not(All)orWll}; 

SOI ^ {AlO or WIO} and {not(All) or Wll}; 

510 = {not(A10) or WIO} and {All or Wll}; 

511 = {AlO or WIO} and {All or Wll}. 

25 Accordingly, if the wild-card bit WIO is set, two of the memory cells 42, m the left half 
and right half, respectively, of the array, will be addressed. If the wild-card bit Wll is 
set, two of the memory cells 42 in the upper half and lower half, respectively, of the 
array, will be addressed. Furthermore, if both of the wild-card bits W10,W11 are set 
four of the memory cells 42 to die top-left, top-right, bottom-left and bottom-right of die 

30 array will be addressed. Therefore, it is possible to make multiple simultaneous writes 
to the array in a single cycle. 
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The arrangement described with respect to figure 18 may be modified. For example, the 
wild-card bits may be associated with the address bits A00,A01 for the lower level 
decoders 52, so that multiple writes may be made to the two memory cells 42 to the left, 
to the right, above or below the addressed lower-level decoder 52, or to all four of the 
5 memory cells 42 associated with the addressed lower-level decoder 52. Furthermore, 
such wild-card bits may be associated with all of the address bits A00,A01,A10,A11 or 
with only some of them. 

The technique of multiple writes has been described with reference to a hierarchical 
10 mcsnory having only two levels of decoder, and it will be eqjpreciated that the technique 
may be used with hierarchical memory having more than two levels, and may be applied 
to one, some or all of those levels. 

In the arrangement described above with reference to figure 18, the width of the address 
15 bus is increased by one for each address bit which is to have the capability of being 
wild-carded. Nevertheless, the wild-card information can be modified on a cycle-by- 
cycle basis without a performance penalty. In a modified arrangement, the wild-card 
inflation is stored adjacent each decoder in a memory cell for each wild-card bit and is 
pre-loaded into the wild-card memory cells. This reduces the cost of wiring, but requires 
20 additional storage paths and operating cycles for changing the wild-carding mformation. 

Many other modifications and developments may also be made. For example, in 
arrangements like that of figure 18, the higher or highest level decoder 54 produces its 
four output signals S00,S01,S10,S11 fi-om four input signals A10,W10,A11,W1I. 
25 Alternatively, the four signals S00,S01,S10,S11 could be directly fed mto the 
arrangement for the same wiring cost and obviating the need for the higher or highest 
level decoder. 
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CLAIMS 



1 . A data routing device, comprising: 

at least one connection matrix (14;40) for routing data in the device, the 
S connection matrix including a phirality of memory cells (24;42); and 

means for providing input to and/or output from said memory cells, wherein said 
means includes a tree structure of ii^mt paths to and/or output paths from the memory 
cells, and wherein each sudi path of the tree structure includes a set of branching 
choices (34;52,S4) along the tree structure. 

10 

2. A device as clauned in claim 1, wherein all of the paths include the same number 
of branching choices. 

3. A device as claimed in claim 1 or 2, wherein, at any level in the tree structure, 
15 the number of branches at any branching choice at that level is equal to the number of 

branches at the other branching choices at that level. 

4. A device as claimed in any preceding claim, wherein the number of branches at 
each bran(±ing choice is two. four or eight. 

20 

S* A device as claimed in any preceding claim, wherein all of the paths are of 
substantially die same length. 

6. A device as claimed in any preceding claim, wherein: 
25 the coimection matrix, or at least one of the connection matrices, includes a 

plurality of switches (16,18,20); and 

the memory cells (24) are operable to store data for controlling the switches to 
define the configuration of the interconnections of that connection matrix. 

30' 7. A device as clauned in claim 6, wherem the connection matrix is arranged to 
interconnect a plurality of processing devices (12) or gate arrays. 
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8. A device as claimed in any preceding claim, wherein the memory cells (42) of 
the connection matrix, or at least one of the connection matrices, are arranged to receive 
data in one format, to store the data temporarily, and to output the data in another 
format. 

5 

9. A device as claimed ih claim 8, wherein the memory cells are arranged as a 
comer-turning memory (40). 

10. A device as claimed in any preceding claim, wherein the tree structure provides 
10 input paths (32;0,AO,A1,SOO-S11) for addressing the memory cells, each branching 

choice being provided by an address decoder (34;52,54) which, in response to an input 
address from a higher level, is operable to sub-address less than all of the address 
decoders or memory cells at the next lower level. 

IS 11. A device as claimed in claim 10, wherein at least one of the address decoders 
(34) is operable to sub-address one of the address decoders or memory cells at the next 
lower level. 

12. A device as clauned in claim 10, wherein at least one of the address decoders 
20 (52,54) is operable to sub-address two of the address decoders or memory cells at the 

next lower level, 

13. A device as claimed in any of claims 10 to 12, wherein, at at least one level of 
the tree structure, the or each address decoder is operable to address a selectable number 

25 of die address decoders at the next lower level. 

14. A device as claimed in any preceding claim, wherein the tree structure provides 
input and/or output paths (HO0-Hll,V0O-Vll,D00A-DllA,D00B-DllB,d0O-dll) for 
data to/from the memory cells, each branching choice being provided by a multiplexor 

30 (52,54) which passes data from the next higher level to less than all of the multiplexers 
or memory cells at the next lower level and/or which passes to the next higher level data 
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from less than all of tbe multiplexers or memory cells at the next lower level. 
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